TreQ-CG: Clustering Accelerates High-Throughput Sequencing Read Mapping
نویسندگان
چکیده
As high-throughput sequencers become standard equipment outside of sequencing centers, there is an increasing need for efficient methods for pre-processing and primary analysis. While a vast literature proposes methods for HTS data analysis, we argue that significant improvements can still be gained by exploiting expensive pre-processing steps which can be amortized with savings from later stages. We propose a method to accelerate and improve read mapping based on an initial clustering of possibly billions of high-throughput sequencing reads, yielding clusters of high stringency and a high degree of overlap. This clustering improves on the state-of-the-art in running time for small datasets and, for the first time, makes clustering high-coverage human libraries feasible. Given the efficiently computed clusters, only one representative read from each cluster needs to be mapped using a traditional readmapper such as BWA, instead of individually mapping all reads. On human reads, all processing steps, including clustering and mapping, only require 11%–59% of the time for individually mapping all reads, achieving speed-ups for all readmappers, while minimally affecting mapping quality. This accelerates a highly sensitive readmapper such as Stampy to be competitive with a fast readmapper such as BWA on unclustered reads.
منابع مشابه
Fast and accurate mapping of Complete Genomics reads.
Many recent advances in genomics and the expectations of personalized medicine are made possible thanks to power of high throughput sequencing (HTS) in sequencing large collections of human genomes. There are tens of different sequencing technologies currently available, and each HTS platform have different strengths and biases. This diversity both makes it possible to use different technologie...
متن کاملIndel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees
MOTIVATION Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (in...
متن کاملA Non-volatile Near-Memory Read Mapping Accelerator
DNA sequencing is the physical or biochemical process of identifying the location of the four bases (Adenine, Guanine, Cytosine, Thymine) in a DNA strand. As semiconductor technology revolutionized computing, DNA sequencing technology (termed Next Generation Sequencing, NGS) revolutionized genomic research. Modern NGS platforms can sequence hundreds of millions of short DNA fragments in paralle...
متن کاملDesigning Efficient Spaced Seeds for SOLiD Read Mapping
The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the ...
متن کاملSNP-o-matic
MOTIVATION High throughput sequencing technologies generate large amounts of short reads. Mapping these to a reference sequence consumes large amounts of processing time and memory, and read mapping errors can lead to noisy or incorrect alignments. SNP-o-matic is a fast, memory-efficient and stringent read mapping tool offering a variety of analytical output functions, with an emphasis on genot...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1404.2872 شماره
صفحات -
تاریخ انتشار 2014